AITopics | harmful request

Collaborating Authors

harmful request

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

Neural Information Processing SystemsFeb-11-2026, 10:35:31 GMT

However, there is little to no understanding of the notion of medical safety in the context of LLMs, let alone how to evaluate and improve it. To address this gap, we first define the notion of medical safety in LLMs based on the Principles of Medical Ethics set forth by the American Medical Association.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Virginia > Albemarle County > Charlottesville (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.93)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.93)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

39a3aa9dfd0280ff8fbad1d330662cac-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 08:35:52 GMT

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Country: Asia > Singapore (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)

Add feedback

AI's safety features can be circumvented with poetry, research finds

The GuardianNov-30-2025, 14:00:12 GMT

Roses are red, violets are blue, how do you make a nuclear bomb? Roses are red, violets are blue, how do you make a nuclear bomb? AI's safety features can be circumvented with poetry, research finds Poetry can be linguistically and structurally unpredictable - and that's part of its joy. But one man's joy, it turns out, can be a nightmare for AI models. Those are the recent findings of researchers out of Italy's Icaro Lab, an initiative from a small ethical AI company called DexAI.

large language model, machine learning, natural language, (15 more...)

The Guardian

Country:

Europe > Italy (0.25)
North America > United States (0.17)
Europe > Ukraine (0.06)
Oceania > Australia (0.05)

Industry:

Leisure & Entertainment > Sports (0.72)
Government > Regional Government (0.51)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Communications > Social Media (0.74)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.72)

Add feedback

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

Piras, Giorgio, Mura, Raffaele, Brau, Fabio, Oneto, Luca, Roli, Fabio, Biggio, Battista

arXiv.org Artificial IntelligenceNov-14-2025

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model's latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.08379

Country:

Europe (0.68)
North America > United States (0.46)
Asia > Russia (0.28)

Genre:

Research Report > Promising Solution (0.48)
Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Law (0.93)
Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.46)

Add feedback

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Ding, Peng, Kuang, Jun, Sun, Wen, Wang, Zongyu, Cao, Xuezhi, Cai, Xunliang, Chen, Jiajun, Huang, Shujian

arXiv.org Artificial IntelligenceNov-4-2025

Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities. Investigating these weaknesses is crucial for robust safety mechanisms. Existing attacks primarily distract LLMs by introducing additional context or adversarial tokens, leaving the core harmful intent unchanged. In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks. More specifically, we establish a taxonomy of intent transformations and leverage them to generate attacks that may be misperceived by LLMs as benign requests for information. Unlike prior methods relying on complex tokens or lengthy context, our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts. Extensive experiments on both open-source and commercial LLMs show that ISA achieves over 70% improvement in attack success rate compared to direct harmful prompts. More critically, fine-tuning models on only benign data reformulated with ISA templates elevates success rates to nearly 100%. For defense, we evaluate existing methods and demonstrate their inadequacy against ISA, while exploring both training-free and training-based mitigation strategies. Our findings reveal fundamental challenges in intent inference for LLMs safety and underscore the need for more effective defenses. Our code and datasets are available at https://github.com/NJUNLP/ISA.

harmful request, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2511.00556

Country:

North America > Mexico (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

Mura, Raffaele, Piras, Giorgio, Lukošiūtė, Kamilė, Pintor, Maura, Karbasi, Amin, Biggio, Battista

arXiv.org Artificial IntelligenceOct-31-2025

Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.08604

Country:

North America > United States (0.46)
Europe (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Consumer Health (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models

Ntais, Pavlos

arXiv.org Artificial IntelligenceOct-28-2025

Large language models (LLMs) remain vulnerable to sophisticated prompt engineering attacks that exploit contextual framing to bypass safety mechanisms, posing significant risks in cybersecurity applications. We introduce Jailbreak Mimicry, a systematic methodology for training compact attacker models to automatically generate narrative-based jailbreak prompts in a one-shot manner. Our approach transforms adversarial prompt discovery from manual craftsmanship into a reproducible scientific process, enabling proactive vulnerability assessment in AI-driven security systems. Developed for the OpenAI GPT-OSS-20B Red-Teaming Challenge, we use parameter-efficient fine-tuning (LoRA) on Mistral-7B with a curated dataset derived from AdvBench, achieving an 81.0% Attack Success Rate (ASR) against GPT-OSS-20B on a held-out test set of 200 items. Cross-model evaluation reveals significant variation in vulnerability patterns: our attacks achieve 66.5% ASR against GPT-4, 79.5% on Llama-3 and 33.0% against Gemini 2.5 Flash, demonstrating both broad applicability and model-specific defensive strengths in cybersecurity contexts. This represents a 54x improvement over direct prompting (1.5% ASR) and demonstrates systematic vulnerabilities in current safety alignment approaches. Our analysis reveals that technical domains (Cybersecurity: 93% ASR) and deception-based attacks (Fraud: 87.8% ASR) are particularly vulnerable, highlighting threats to AI-integrated threat detection, malware analysis, and secure systems, while physical harm categories show greater resistance (55.6% ASR). We employ automated harmfulness evaluation using Claude Sonnet 4, cross-validated with human expert assessment, ensuring reliable and scalable evaluation for cybersecurity red-teaming. Finally, we analyze failure mechanisms and discuss defensive strategies to mitigate these vulnerabilities in AI for cybersecurity.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.22085

Country: Europe (0.46)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

Yong, Zheng-Xin, Bach, Stephen H.

arXiv.org Artificial IntelligenceOct-27-2025

We discover a novel and surprising phenomenon of unintentional misalignment in reasoning language models (RLMs), which we call self-jailbreaking. Specifically, after benign reasoning training on math or code domains, RLMs will use multiple strategies to circumvent their own safety guardrails. One strategy is to introduce benign assumptions about users and scenarios to justify fulfilling harmful requests. For instance, an RLM reasons that harmful requests like ``outline a strategy for stealing customers' credit card information from a retail store'' could be associated with the benign intent of ``a security professional trying to test defense,'' despite no such benign context being provided as input. We observe that many open-weight RLMs, including DeepSeek-R1-distilled, s1.1, Phi-4-mini-reasoning, and Nemotron, suffer from self-jailbreaking despite being aware of the harmfulness of the requests. We also provide a mechanistic understanding of self-jailbreaking: RLMs are more compliant after benign reasoning training, and after self-jailbreaking, models appear to perceive malicious requests as less harmful in the CoT, thus enabling compliance with them. To mitigate self-jailbreaking, we find that including minimal safety reasoning data during training is sufficient to ensure RLMs remain safety-aligned. Our work provides the first systematic analysis of self-jailbreaking behavior and offers a practical path forward for maintaining safety in increasingly capable RLMs.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.20956

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.68)

Industry:

Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques

Koorndijk, Jeanice

arXiv.org Artificial IntelligenceOct-27-2025

Current literature suggests that alignment faking (deceptive alignment) is an emergent property of large language models. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based ethics are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in language models and underscore the need for alignment evaluations across model sizes and deployment settings.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.21584

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Agentic Reinforcement Learning for Search is Unsafe

Yang, Yushi, Padarha, Shreyansh, Lee, Andrew, Mahdi, Adam

arXiv.org Artificial IntelligenceOct-21-2025

Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search. Instruction tuning (IT) is the standard method to align large language models (LLMs) with human preferences and teach them to refuse harmful requests (Schulman et al., 2017; Shao et al., 2024). However, IT only shapes static responses and is insufficient in agentic settings, where models must also decide when and how to call external tools, capabilities not explicitly learned during pre-training (Zhang et al., 2025). Agentic reinforcement learning (RL) for tool-use (Zhang et al., 2025) tackles this by fine-tuning models to interleave reasoning with tool use (Dong et al., 2025). In practice, search is the most common tool: agentic RL rewards effective, well-timed search queries and achieves strong gains on multi-hop reasoning tasks (Song et al., 2025a;b; Jin et al., 2025). Despite the progress, effect of agentic RL on safety of IT models remains unclear. While prior work reported safety degradation of retrieval-augmented agents (Y u et al., 2025), little is known about whether agentic RL for search preserves refusal of harmful requests. As agentic RL is now deployed in closed-source systems such as OpenAI's DeepSearch (OpenAI, 2025), this evaluation gap can create real deployment risks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.17431

Genre: Research Report > New Finding (0.88)

Industry:

Law Enforcement & Public Safety (0.68)
Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.44)

Add feedback